{
  "cells": [
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "%matplotlib inline"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "\nDeploy a Quantized Model on Cuda\n================================\n**Author**: `Wuwei Lin <https://github.com/vinx13>`_\n\nThis article is an introductory tutorial of automatic quantization with TVM.\nAutomatic quantization is one of the quantization modes in TVM. More details on\nthe quantization story in TVM can be found\n`here <https://discuss.tvm.apache.org/t/quantization-story/3920>`_.\nIn this tutorial, we will import a GluonCV pre-trained model on ImageNet to\nRelay, quantize the Relay model and then perform the inference.\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "import tvm\nfrom tvm import te\nfrom tvm import relay\nimport mxnet as mx\nfrom tvm.contrib.download import download_testdata\nfrom mxnet import gluon\nimport logging\nimport os\n\nbatch_size = 1\nmodel_name = \"resnet18_v1\"\ntarget = \"cuda\"\ndev = tvm.device(target)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Prepare the Dataset\n-------------------\nWe will demonstrate how to prepare the calibration dataset for quantization.\nWe first download the validation set of ImageNet and pre-process the dataset.\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "calibration_rec = download_testdata(\n    \"http://data.mxnet.io.s3-website-us-west-1.amazonaws.com/data/val_256_q90.rec\",\n    \"val_256_q90.rec\",\n)\n\n\ndef get_val_data(num_workers=4):\n    mean_rgb = [123.68, 116.779, 103.939]\n    std_rgb = [58.393, 57.12, 57.375]\n\n    def batch_fn(batch):\n        return batch.data[0].asnumpy(), batch.label[0].asnumpy()\n\n    img_size = 299 if model_name == \"inceptionv3\" else 224\n    val_data = mx.io.ImageRecordIter(\n        path_imgrec=calibration_rec,\n        preprocess_threads=num_workers,\n        shuffle=False,\n        batch_size=batch_size,\n        resize=256,\n        data_shape=(3, img_size, img_size),\n        mean_r=mean_rgb[0],\n        mean_g=mean_rgb[1],\n        mean_b=mean_rgb[2],\n        std_r=std_rgb[0],\n        std_g=std_rgb[1],\n        std_b=std_rgb[2],\n    )\n    return val_data, batch_fn"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "The calibration dataset should be an iterable object. We define the\ncalibration dataset as a generator object in Python. In this tutorial, we\nonly use a few samples for calibration.\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "calibration_samples = 10\n\n\ndef calibrate_dataset():\n    val_data, batch_fn = get_val_data()\n    val_data.reset()\n    for i, batch in enumerate(val_data):\n        if i * batch_size >= calibration_samples:\n            break\n        data, _ = batch_fn(batch)\n        yield {\"data\": data}"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Import the model\n----------------\nWe use the Relay MxNet frontend to import a model from the Gluon model zoo.\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "def get_model():\n    gluon_model = gluon.model_zoo.vision.get_model(model_name, pretrained=True)\n    img_size = 299 if model_name == \"inceptionv3\" else 224\n    data_shape = (batch_size, 3, img_size, img_size)\n    mod, params = relay.frontend.from_mxnet(gluon_model, {\"data\": data_shape})\n    return mod, params"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Quantize the Model\n------------------\nIn quantization, we need to find the scale for each weight and intermediate\nfeature map tensor of each layer.\n\nFor weights, the scales are directly calculated based on the value of the\nweights. Two modes are supported: `power2` and `max`. Both modes find the\nmaximum value within the weight tensor first. In `power2` mode, the maximum\nis rounded down to power of two. If the scales of both weights and\nintermediate feature maps are power of two, we can leverage bit shifting for\nmultiplications. This make it computationally more efficient. In `max` mode,\nthe maximum is used as the scale. Without rounding, `max` mode might have\nbetter accuracy in some cases. When the scales are not powers of two, fixed\npoint multiplications will be used.\n\nFor intermediate feature maps, we can find the scales with data-aware\nquantization. Data-aware quantization takes a calibration dataset as the\ninput argument. Scales are calculated by minimizing the KL divergence between\ndistribution of activation before and after quantization.\nAlternatively, we can also use pre-defined global scales. This saves the time\nfor calibration. But the accuracy might be impacted.\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "def quantize(mod, params, data_aware):\n    if data_aware:\n        with relay.quantize.qconfig(calibrate_mode=\"kl_divergence\", weight_scale=\"max\"):\n            mod = relay.quantize.quantize(mod, params, dataset=calibrate_dataset())\n    else:\n        with relay.quantize.qconfig(calibrate_mode=\"global_scale\", global_scale=8.0):\n            mod = relay.quantize.quantize(mod, params)\n    return mod"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Run Inference\n-------------\nWe create a Relay VM to build and execute the model.\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "def run_inference(mod):\n    model = relay.create_executor(\"vm\", mod, dev, target).evaluate()\n    val_data, batch_fn = get_val_data()\n    for i, batch in enumerate(val_data):\n        data, label = batch_fn(batch)\n        prediction = model(data)\n        if i > 10:  # only run inference on a few samples in this tutorial\n            break\n\n\ndef main():\n    mod, params = get_model()\n    mod = quantize(mod, params, data_aware=True)\n    run_inference(mod)\n\n\nif __name__ == \"__main__\":\n    main()"
      ]
    }
  ],
  "metadata": {
    "kernelspec": {
      "display_name": "Python 3",
      "language": "python",
      "name": "python3"
    },
    "language_info": {
      "codemirror_mode": {
        "name": "ipython",
        "version": 3
      },
      "file_extension": ".py",
      "mimetype": "text/x-python",
      "name": "python",
      "nbconvert_exporter": "python",
      "pygments_lexer": "ipython3",
      "version": "3.6.9"
    }
  },
  "nbformat": 4,
  "nbformat_minor": 0
}